Regionalisation with Spatially Constrained Cluster Analysis

Published

December 5, 2022

case study : Regionalisation by Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods.

1. OVERVIEW

Regionalisation with Spatially Constrained clustering analysis requires similar observations to be grouped according to their statistical attributes and spatial location.

This study focuses on regionalising analysis based on Nigeria’s water points attributes.

1.1 Objectives

Regionalise Nigeria by using the following measures :

  • Total number of water points by status, i.e. functional, non-functional, and unknown;

  • Percentage of water points by :

    • status (functional, non-functional, and unknown);

    • deployed water technology (hand pump, mechanical pump, stand tap, etc.) ;

    • usage capacity (1000, 300, 250, 50);

    • rural or urban.

1.2 Scope of Works

Some of the specific tasks for this study are :

  • import the shapefile into R with the appropriate sf method, and save it in a simple feature data frame format;
note

Three (3) Projected Coordinate Systems of Nigeria, EPSG : 26391, 26392, and 26303.

  • derive the proportion of functional and non-functional water points at LGA level (i.e. ADM2) by appropriate tidyr and dplyr methods;

  • combine geospatial and aspatial data frames into a simple feature data frame.

  • delineate water points measures functional regions by using :

    • conventional hierarchical clustering.

    • spatially constrained clustering algorithms.

  • plot two (2) main types of maps below :

    Thematic Mapping

    Show the derived water-point measures by appropriate statistical graphics and choropleth mapping technique.

    Analytical Mapping

    Plot delineated functional regions using non-spatially constrained and spatially constrained clustering algorithms.


2. R PACKAGE REQUIRED

The following are the packages required for this exercise :

2.1 Load R Packages into R Environment

Usage of the code chunk below :

p_load( ) - pacman - to load packages. This function will attempt to install the package from CRAN or pacman repository list if its found not installed.

pacman::p_load(sf, tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally)


3. GEOSPATIAL DATA

3.1 Acquire Data Source

Note

The file size of the downloaded data is about 422 MB due to water points data from multiple countries.

To avoid any error of pushing files larger than 100 MB to Git cached in the Git Push function, filtered the water points for Nigeria and removed unnecessary variables. As a result, the file size is reduced to about 23 MB.

3.2 Import Aspatial Data

3.2.4 Create Master File

Usage of the code chunk below :

left_join( ) - dplyr - to combine wp_coord, wp_cond and wp_adm.

wp <- left_join(
  
  (left_join
   (wp_coord,wp_cond,
     by = c("row_id")
     )
   ),
  wp_adm, by = c("row_id"))

3.2.5 Convert Well Known Text (WKT) Data to SF Data Frame

  • The “New Georeferenced Column” in wp_rds contains spatial data in a WKT format.

  • Two (2) steps to convert the WKT data format into an sf data frame.

3.2.5.1 derive new field :: “geometry”

Usage of the code chunk below :

st_as_sfc( ) - sf - to convert foreign geometry object `New Georeferenced Column` to an sfc object

wp$geometry = st_as_sfc(wp$`New Georeferenced Column`)

3.2.5.2 convert to SF Data Frame

Usage of the code chunk below :

st_sf( ) - sf - to convert the tibble data frame into sf data frame with crs first set to WGS 84 (EPSG : 4326).

st_crs( ) - sf - to retrieve coordinate reference system from the object.

wp_sf<- st_sf(wp, crs = 4326)
st_crs(wp_sf)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

3.2.5.3 retrieve geometry summary :: wp_sf

Usage of the code chunk below :

st_geometry( ) - sf - to get the geometry summary.

st_geometry(wp_sf)
Geometry set for 95008 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Geodetic CRS:  WGS 84
First 5 geometries:
POINT (6.95009 6.78599)
POINT (7.604793 6.78321)
POINT (7.60024 6.759284)
POINT (7.615451 6.799595)
POINT (7.65991 6.762375)

3.3 Import Boundary Data of Nigeria LGA

Usage of the code chunk below :

st_read( ) - sf - to read simple features.

select( ) - dplyr - to select “shapeName” variable.

bdy_nga <- st_read(dsn = "/jephOstan/ISSS624/class_project/project_2/data/geospatial",
               layer = "geoBoundaries-NGA-ADM2",
               crs = 4326) %>%
  select(shapeName)
Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `D:\jephOstan\ISSS624\class_project\project_2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84
problems(bdy_nga)

3.3.1 Review Imported File

3.3.1.1 check for missing data

Usage of the code chunk below :

freq.na( ) - questionr - to generate frequency table of missing value.

freq.na(bdy_nga$shapeName)
missing       % 
      0       0 

3.3.1.2 check for duplication :: “shapeName”

Usage of the code chunk below :

duplicated( ) - base - to determine duplicate elements.

freq(duplicated(bdy_nga$shapeName))
        n    % val%
FALSE 768 99.2 99.2
TRUE    6  0.8  0.8

3.3.1.3 list the duplicated value :: “shapeName”

Usage of the code chunk below :

add_count( ) - dplyr - to count observations by group

filter( ) - dplyr - to retain shapeName that has count not equal to 1.

wp_duplShapeName <- bdy_nga %>%
  add_count(bdy_nga$shapeName) %>%
  filter(n!=1) %>%
  select(-n)

wp_duplShapeName
Simple feature collection with 12 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 3.316459 ymin: 6.459038 xmax: 9.020704 ymax: 12.05035
Geodetic CRS:  WGS 84
First 10 features:
   shapeName bdy_nga$shapeName                       geometry
1      Bassa             Bassa MULTIPOLYGON (((6.708541 7....
2      Bassa             Bassa MULTIPOLYGON (((8.823522 10...
3   Ifelodun          Ifelodun MULTIPOLYGON (((4.664107 8....
4   Ifelodun          Ifelodun MULTIPOLYGON (((4.721977 7....
5   Irepodun          Irepodun MULTIPOLYGON (((5.05493 8.0...
6   Irepodun          Irepodun MULTIPOLYGON (((4.543349 7....
7   Nasarawa          Nasarawa MULTIPOLYGON (((8.554589 11...
8   Nasarawa          Nasarawa MULTIPOLYGON (((7.493228 8....
9        Obi               Obi MULTIPOLYGON (((8.191919 6....
10       Obi               Obi MULTIPOLYGON (((9.008576 8....

3.3.1.4 verify findings in section 3.3.1.3

Usage of the code chunk below :

tmap_mode( ) - tmap - to set tmap mode to static plotting or interactive.

tm_shape( ) - tmap - to specify the shape object.

tm_polygons( ) - tmap - to fill the polygons and draw the polygon borders.

tm_view( ) - tmap - to set the options for the interactive tmap viewer.

tm_fill( ) - tmap - to specify either which colour to be used or which data variable mapped to the colour palette.

tm_borders( ) - tmap - to draw the polygon borders.

tmap_style( ) - tmap - to set the tmap style.

tm_layout( ) - tmap - to set the layout of cartographic map.

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(bdy_nga)+
  tm_polygons()+
  tm_view(set.zoom.limits = c(6,8))+

tm_shape(wp_duplShapeName)+
  tm_fill("shapeName",
          n = 6,
          style = "jenks")+
  tm_borders(alpha = 0.5)+
  tmap_style("albatross")+
  tm_layout(main.title = "Distribution of Duplicated ShapeName",
            main.title.size = 1.3,
            main.title.position = "center")
tmap style set to "albatross"
other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "beaver", "bw", "classic", "watercolor" 

Remarks :

The plot above indicates those duplicated water points are from different Nigeria states.

tmap_mode("plot")
tmap mode set to plotting

3.3.1.5 acquire State info for duplicated value

The State info to be combined with the duplicated “shapeName”. This will make all the shapeName unique.

lga row_id headquarter state iso3166code state_dd_coordinates
Bassa 94 Oguma Kogi NG.KO.BA 7.75 6.75
Bassa 95 Bassa Plateau NG.PL.BA 9.16667 9.75
Ifelodun 304 Share Kwara NG.KW.IF 8.5 5.0
Ifelodun 305 Ikirun Osun NG.OS.ID 7.5 4.5
Irepodun 355 Omu Aran Kwara NG.KW.IR 8.5 5.0
Irepodun 356 Ilobu Osun NG.OS.IP 7.5 4.5
Nasarawa 519 Bompai Kano NG.KN.NA 11.5 8.5
Nasarawa 520 Nasarawa Nasarawa NG.NA.NA 8.53 7.7
Obi 546 Obi Nasarawa NG.NA.OB 8.53 7.7
Obi 547 Obarike-Ito Benue NG.BE.OB 7.33333 8.75
Surelere 693 Surelere Lagos NG.LA.SU 6.5 3.35
Surelere 694 Iresa-Adu Oyo NG.OY.SU 8.07 4.41


3.4 Data Wrangling

3.4.1 Edit Duplicated Value :: “shapeName”

bdy_nga$shapeName[c(94,95,304,305,355,356,519,546,547,693,694)] <- 
  c("Bassa Kogi",
    "Bassa Plateau",
    "Ifelodun Kwara",
    "Ifelodun Osun",
    "Irepodun Kwara",
    "Irepodun Osun",
    "Nasarawa Kano",
    "Nasarawa Nasarawa",
    "Obi Nasarawa",
    "Obi Benue",
    "Surulere Lagos",
    "Surulere Oyo")
Warning in bdy_nga$shapeName[c(94, 95, 304, 305, 355, 356, 519, 546, 547, :
number of items to replace is not a multiple of replacement length

3.4.1.1 validate edited value :: “shapeName”

wp_duplShapeName1 <- bdy_nga %>%
  add_count(bdy_nga$shapeName) %>%
  filter(n!=1) %>%
  select(-n)

wp_duplShapeName1
Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] shapeName         bdy_nga$shapeName geometry         
<0 rows> (or 0-length row.names)

3.4.2 Perform Point-in-Polygon Overlay

This step combine both attribute and boundary of the water points into a simple feature object.

3.4.2.1 join Objects :: wp_sf and bdy_nga

Usage of the code chunk below :

st_join( ) - sf - to join sf-class objects based on geometry, namely, wp_sf and bdy_nga.

wp_joined <- st_join(wp_sf, bdy_nga)

3.4.2.2 save and read RDS File :: wp_joined

write_rds(wp_joined,"/jephOstan/ISSS624/class_project/project_2/data/geodata/wp_joined.rds",compress = "xz")

wp_joined <- read_rds("/jephOstan/ISSS624/class_project/project_2/data/geodata/wp_joined.rds")

3.4.2.3 inspect joined file :: wp_joined

st_geometry(wp_joined)
Geometry set for 95008 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Geodetic CRS:  WGS 84
First 5 geometries:
POINT (6.95009 6.78599)
POINT (7.604793 6.78321)
POINT (7.60024 6.759284)
POINT (7.615451 6.799595)
POINT (7.65991 6.762375)

-- determine reference point :: “shapeName” or “clean_adm2”

wp_refLga <- (wp_joined$shapeName == wp_joined$clean_adm2)
freq(wp_refLga)
          n    % val%
FALSE 29713 31.3 31.3
TRUE  65266 68.7 68.7
NA       29  0.0   NA

Remarks :

There are 29,713 of “FALSE”, which is more than 30% of local government areas’ name mismatched between “shapeName” and “clean_adm2”.

  • Unlike the Water Point Data Exchange data that involved multitple parties, the geoBoundaries data is sourced from “geoBoundaries: A global database of political administrative boundaries.” Plos one 15, no. 4 (2020): e0231866,

    • Hence, the “shapeName” to be used throughout this study.

-- assess uniqueness of each Water Point

wp_joined %>% janitor::get_dupes(shapeName,lat_lon_deg)
No duplicate combinations found of: shapeName, lat_lon_deg
Simple feature collection with 0 features and 24 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
# A tibble: 0 × 25
# … with 25 variables: shapeName <chr>, lat_lon_deg <chr>, dupe_count <int>,
#   row_id <dbl>, lat_deg <dbl>, lon_deg <dbl>, New Georeferenced Column <chr>,
#   water_source <chr>, water_source_clean <chr>, water_source_category <chr>,
#   water_tech_clean <chr>, water_tech_category <chr>, status_clean <chr>,
#   status <chr>, clean_adm1 <chr>, clean_adm2 <chr>,
#   water_point_population <dbl>, local_population_1km <dbl>,
#   crucialness_score <dbl>, pressure_score <dbl>, usage_capacity <dbl>, …

Remarks :

Each water point observation is unique as there are no duplicate combination of “shapeName” together with “lat_lon_deg”.

-- reveal value :: “status_clean”

freq(wp_joined$status_clean)
                                     n    % val%
Abandoned                          175  0.2  0.2
Abandoned/Decommissioned           234  0.2  0.3
Functional                       45883 48.3 54.4
Functional but needs repair       4579  4.8  5.4
Functional but not in use         1686  1.8  2.0
Non-Functional                   29385 30.9 34.8
Non-Functional due to dry season  2403  2.5  2.8
Non functional due to dry season     7  0.0  0.0
NA                               10656 11.2   NA

-- reveal value :: “crucialness_score”

summary(wp_joined$crucialness_score)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.130   0.304   0.414   0.628   1.000    6879 

-- reveal value :: “is_urban”

freq(wp_joined$is_urban)
          n    % val%
FALSE 75444 79.4 79.4
TRUE  19564 20.6 20.6

-- reveal value :: “water_tech_category”

freq(wp_joined$water_tech_category)
                    n    % val%
Hand Pump       58755 61.8 69.2
Mechanized Pump 25644 27.0 30.2
Rope and Bucket     1  0.0  0.0
Tapstand          553  0.6  0.7
NA              10055 10.6   NA

-- reveal value :: “usage_capacity”

freq(wp_joined$usage_capacity)
         n    % val%
50       2  0.0  0.0
250    573  0.6  0.6
300  68789 72.4 72.4
1000 25644 27.0 27.0

3.4.3 Replace “NA” with “Unknown”

mutate( ) - dplyr - to run replace_na( ) function.

  • replace_na( ) - tidyr - to replace NAs with “unknown”.
wp_joined1 <- wp_joined %>%
  mutate(status_clean = replace_na(status_clean, "Unknown")) %>%
  mutate(water_tech_category = replace_na(water_tech_category, "Unknown")) %>%
  mutate(status = replace_na(status, "Unknown")) %>%
  mutate(water_point_population = replace_na(water_point_population, 0)) %>%
  mutate(local_population_1km = replace_na(local_population_1km, 0)) %>%
  mutate(crucialness_score = replace_na(crucialness_score, 0)) %>%
  mutate(pressure_score = replace_na(pressure_score, 0))

3.4.4 Standardise Value

3.4.4.1 combine value :: “status_clean”

wp_joined1 <- wp_joined1 %>%
  mutate(status_clean = str_replace(status_clean,"Non functional due to dry season"  ,"Non-Functional due to dry season")) %>%
  mutate(status_clean = str_replace(status_clean,"Abandoned/Decommissioned/Decommissioned","Abandoned/Decommissioned"))

3.4.4.2 review “status_clean”

freq(wp_joined1$status_clean)

3.4.4.3 read RDS file :: wp_joined1

wp_joined1 <- read_rds("/jephOstan/ISSS624/class_project/project_2/data/geodata/wp_joined1.rds")

3.4.5 Extract Water Point for New Table :: wp_nga

3.4.5.1 extract functional water point

wpt_functional <- wp_joined1 %>%
  filter(status_clean %in%
           c("Functional", 
             "Functional but not in use",
             "Functional but needs repair"))

-- save and read RDS file :: wpt_functional

write_rds(wpt_functional,"/jephOstan/ISSS624/class_project/project_2/data/geodata/wpt_functional.rds",compress = "xz")

wpt_functional <- read_rds("/jephOstan/ISSS624/class_project/project_2/data/geodata/wpt_functional.rds")

3.4.5.2 inspect variable and value

-- reveal value :: “status_clean”

freq(wpt_functional$status_clean)
                                n    % val%
Functional                  45883 88.0 88.0
Functional but needs repair  4579  8.8  8.8
Functional but not in use    1686  3.2  3.2
length(wpt_functional$row_id)
[1] 52148
length(wpt_functional$row_id)/length(wp_joined1$row_id)*100
[1] 54.88801

Remarks :

The total functional water points is 52,148 which is about 54.89% of total water points.

-- reveal value :: “usage_capacity”

freq(wpt_functional$usage_capacity)
         n    % val%
50       2  0.0  0.0
250     75  0.1  0.1
300  38064 73.0 73.0
1000 14007 26.9 26.9

-- reveal value “usage_capacity” by “status_clean”

wpt_functional %>% count(status_clean, usage_capacity, sort = TRUE)
Simple feature collection with 10 features and 3 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: 2.711632 ymin: 4.302938 xmax: 13.5022 ymax: 13.86331
Geodetic CRS:  WGS 84
# A tibble: 10 × 4
   status_clean                usage_capacity     n                     geometry
 * <chr>                                <dbl> <int>             <MULTIPOINT [°]>
 1 Functional                             300 33687 ((3.064921 7.994882), (3.06…
 2 Functional                            1000 12124 ((3.080189 7.99252), (3.085…
 3 Functional but needs repair            300  3306 ((3.340832 8.037962), (3.34…
 4 Functional but needs repair           1000  1271 ((3.373801 7.992051), (3.33…
 5 Functional but not in use              300  1071 ((3.046639 8.017765), (2.88…
 6 Functional but not in use             1000   612 ((3.088655 8.005296), (3.05…
 7 Functional                             250    70 ((3.355785 6.498105), (3.67…
 8 Functional but not in use              250     3 ((8.032945 6.878883), (7.00…
 9 Functional                              50     2 ((7.027967 4.765731), (8.92…
10 Functional but needs repair            250     2 ((6.465915 5.826699), (7.93…

-- reveal value :: “crucialness_score”

summary(wpt_functional$crucialness_score == 1)
   Mode   FALSE    TRUE 
logical   47006    5142 

-- determine the total population within 1 km by “crucialness_score”

freq(wpt_functional$crucialness_score == 1)
          n    % val%
FALSE 47006 90.1 90.1
TRUE   5142  9.9  9.9
sum(wpt_functional[wpt_functional$crucialness_score == 1,]$local_population_1km)
[1] 11252574

Remarks :

Given the “crucialness_score” is a ratio of current water point users to the total population within 1 km radius thereof :

  • Currently, 5,142 water points serve the population within a 1 km radius at its capacity limit.

    • The usage capacity may need to be increased to sustain or improve the growth or development rate within 1km of these water points.

    • Should the population within 1 km therefrom grow above 11,252,574, there may be multiple repercussions in resources management, urbanisation progress, local food and beverage consumption, local commodity prices, or worst case scenario would be the stability of civil society.

summary(wpt_functional$pressure_score > 1)
   Mode   FALSE    TRUE 
logical   27469   24679 
length(wpt_functional$pressure_score)
[1] 52148
24679/52148*100 #percentage of functional waterpoints over their usage limit
[1] 47.32492

Remarks :

Given the “pressure_score” is the ratio of the current water point users to the usage capacity thereof :

  • 24,679, or 47.32% of functional water points, are currently over their limit of usage capacity.

3.4.5.3 Exploratory Data Analysis (EDA) :: wpt_functional

-- plot “status_clean”

ggplot(data = wpt_functional,
       aes(fct_infreq(status_clean), fill=status_clean))+ 
  geom_bar()+
  geom_text(
     aes(label=after_stat(count)),
     stat='count',
     nudge_x=-0.25,
     vjust=-0.2)+
  geom_text(
     aes(label= scales::percent(signif(after_stat(count/sum(count))))),
     stat='count',
     nudge_x=0.25,
     vjust=-0.2)+
  scale_x_discrete(guide = guide_axis(n.dodge = 2))+
  guides(fill=guide_legend (title = 'Status'))

-- plot “water_tech_category”

ggplot(data=wpt_functional, 
       aes(x=fct_infreq(
         water_tech_category)))+
  geom_bar(aes(
    fill = water_tech_category), 
    width = 0.8)+
  geom_text(aes(
    label = ..count..),
    stat = "count", 
    vjust=-0.2, 
    size = 3.5, 
    color = "black")+
  scale_x_discrete(guide = guide_axis(n.dodge = 2))+
  guides(fill=guide_legend (title = 'Water Tech Deployed'))
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

-- plot “water_source_clean”

ggplot(data=wpt_functional, 
       aes(x=fct_infreq(
         water_source_clean)))+
  geom_bar(aes(
    fill = water_source_clean), 
    width = 0.8)+
  geom_text(aes(
    label = ..count..),
    stat = "count", 
    vjust=-0.2, 
    size = 3.5, 
    color = "black")+
  scale_x_discrete(guide = guide_axis(
    n.dodge = 2))+
  guides(fill=guide_legend (
    title = 'Source of Water Supply'))

3.4.5.4 add attribute to new data table

wp_nga <- bdy_nga %>%
  mutate(`total_wp` = lengths(
    st_intersects(bdy_nga, wp_joined1))) %>%
  
  mutate(`wp_functional` = lengths(
    st_intersects(bdy_nga, wpt_functional))) %>%
  
  mutate(`pct_functional` = (`wp_functional`/`total_wp`*100))

-- replace “NaN” with 0

wp_nga <- wp_nga %>%
  mutate(`pct_functional` = replace_na(pct_functional, 0))

summary(wp_nga)
  shapeName                  geometry      total_wp     wp_functional   
 Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
 Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
 Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                        Mean   :122.7   Mean   : 67.36  
                                        3rd Qu.:168.8   3rd Qu.: 87.75  
                                        Max.   :894.0   Max.   :752.00  
 pct_functional  
 Min.   :  0.00  
 1st Qu.: 32.61  
 Median : 47.41  
 Mean   : 49.84  
 3rd Qu.: 66.99  
 Max.   :100.00  

3.4.5.5 extract non-functional water point

wpt_nonFunctional <- wp_joined1 %>%
  filter(status_clean %in%
           c("Abandoned/Decommissioned", 
             "Non-Functional",
             "Non-Functional due to dry season"))

-- save and read RDS file :: wpt_nonFuntional

write_rds(wpt_nonFunctional,"/jephOstan/ISSS624/class_project/project_2/data/geodata/wpt_nonFunctional.rds",compress = "xz")

wpt_nonFunctional <- read_rds("/jephOstan/ISSS624/class_project/project_2/data/geodata/wpt_nonFunctional.rds")

3.4.5.6 inspect variable and value

-- reveal value :: “status_clean”

freq(wpt_nonFunctional$status_clean)
                                     n    % val%
Abandoned/Decommissioned           234  0.7  0.7
Non-Functional                   29385 91.7 91.7
Non-Functional due to dry season  2410  7.5  7.5
length(wpt_nonFunctional$row_id)
[1] 32029
length(wpt_nonFunctional$row_id)/length(wp_joined1$row_id)*100
[1] 33.7119

Remarks :

There are 32,204, which is about 33.9% out of total water points.

-- reveal value :: “usage_capacity”

freq(wpt_nonFunctional$usage_capacity)
         n    % val%
250     41  0.1  0.1
300  20586 64.3 64.3
1000 11402 35.6 35.6

-- reveal value “usage_capacity” by “status_clean”

wpt_nonFunctional %>% count(status_clean, usage_capacity, sort = TRUE)
Simple feature collection with 7 features and 3 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 13.4192 ymax: 13.86567
Geodetic CRS:  WGS 84
# A tibble: 7 × 4
  status_clean                     usage_capac…¹     n                  geometry
* <chr>                                    <dbl> <int>          <MULTIPOINT [°]>
1 Non-Functional                             300 18492 ((3.064526 7.994448), (3…
2 Non-Functional                            1000 10852 ((3.083391 7.993231), (3…
3 Non-Functional due to dry season           300  2012 ((3.051752 7.984243), (3…
4 Non-Functional due to dry season          1000   398 ((3.056661 7.985808), (3…
5 Abandoned/Decommissioned                  1000   152 ((4.713438 7.891137), (4…
6 Abandoned/Decommissioned                   300    82 ((3.199483 8.912549), (2…
7 Non-Functional                             250    41 ((3.976195 6.582998), (3…
# … with abbreviated variable name ¹​usage_capacity

-- reveal “crucialness_score”

sum(wpt_nonFunctional$local_population_1km)
[1] 93999535
sum(wpt_nonFunctional$water_point_population)
[1] 46255888

Remarks :

Given the “crucialness_score” is a ratio of current water point users to the total population within a 1 km radius thereof , in the context of non-functional :

  • Currently, out of 95,013,340 residents within a 1 km radius, there are 46,710,127 of them is affected by these non-functional water point.

3.4.5.7 EDA :: wpt_nonFunctional

-- plot “status_clean”

ggplot(data = wpt_nonFunctional,
       aes(fct_infreq(status_clean), 
           fill=status_clean))+ 
  geom_bar()+
  geom_text(
     aes(label=after_stat(count)),
     stat='count',
     nudge_x=-0.25,
     vjust=-0.2)+
  geom_text(
     aes(label= scales::percent(
       signif(
         after_stat(
           count/sum(count)
           )))),
     stat='count',
     nudge_x=0.25,
     vjust=-0.2)+
  scale_x_discrete(
    guide = guide_axis(
      n.dodge = 2))+
  guides(fill=guide_legend (
    title = 'Status'))

-- plot “water_tech_category”

ggplot(data=wpt_nonFunctional, 
       aes(fct_infreq(
         water_tech_category)))+
  geom_bar(aes(
    fill = water_tech_category), 
    width = 0.8)+
  geom_text(aes(
    label = ..count..),
    stat = "count",
    vjust=-0.2,
    size = 3.5,
    color = "black")+
  scale_x_discrete(guide = guide_axis(
    n.dodge = 2))+
  guides(fill=guide_legend (
    title = 'Water Tech Deployed'))

-- plot “water_source_clean”

ggplot(data=wpt_nonFunctional, 
       aes(fct_infreq(
         water_source_clean)))+
  geom_bar(aes(
    fill = water_source_clean),
    width = 0.8)+
  geom_text(aes(
    label = ..count..),
    stat = "count",
    vjust=-0.2,
    size = 3.5,
    color = "black")+
  scale_x_discrete(guide = guide_axis(
    n.dodge = 2))+
  guides(fill=guide_legend (
    title = 'Source of Water Supply'))

3.4.5.8 add wpt_nonFunctional to wp_nga

wp_nga <- wp_nga %>%
  mutate(`wp_nonFunctional` = lengths(
    st_intersects(bdy_nga, wpt_nonFunctional))) %>%
  mutate(`pct_nonFunctional` = (`wp_nonFunctional`/`total_wp`*100))

-- replace “NaN” with 0

wp_nga <- wp_nga %>%
  mutate(`pct_nonFunctional` = replace_na(pct_nonFunctional, 0))

summary(wp_nga)
  shapeName                  geometry      total_wp     wp_functional   
 Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
 Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
 Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                        Mean   :122.7   Mean   : 67.36  
                                        3rd Qu.:168.8   3rd Qu.: 87.75  
                                        Max.   :894.0   Max.   :752.00  
 pct_functional   wp_nonFunctional pct_nonFunctional
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   
 1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77   
 Median : 47.41   Median : 33.50   Median : 34.89   
 Mean   : 49.84   Mean   : 41.37   Mean   : 35.58   
 3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00   
 Max.   :100.00   Max.   :278.00   Max.   :100.00   

3.4.5.9 extract unknown water point

wpt_unknown <- wp_joined1 %>%
  filter(status_clean == "Unknown")

-- save and read RDS file :: wpt_unknown

write_rds(wpt_unknown,"/jephOstan/ISSS624/class_project/project_2/data/geodata/wpt_unknown.rds")

wpt_unknown <- read_rds("/jephOstan/ISSS624/class_project/project_2/data/geodata/wpt_unknown.rds")

3.4.5.10 inspect variable and value

-- reveal value :: “status_clean”

length(wpt_unknown$row_id)
[1] 10656
length(wpt_unknown$row_id)/length(wp_joined1$row_id)*100
[1] 11.2159

Remarks :

There are 10,656 water points with unknown status, about 11.22% of total water points.

-- determine affected population

sum(wpt_unknown$water_point_population)
[1] 18831488
sum(wpt_unknown$local_population_1km)
[1] 31418651

3.4.5.11 add wpt_unknown to wp_nga

wp_nga <- wp_nga %>%
  mutate(`wp_unknown` = lengths(
    st_intersects(bdy_nga, wpt_unknown))) %>%
  mutate(`pct_unknown` = (`wp_unknown`/`total_wp`*100))

-- replace “NaN” with 0

wp_nga <- wp_nga %>%
  mutate(`pct_unknown` = replace_na(pct_unknown, 0))

summary(wp_nga)
  shapeName                  geometry      total_wp     wp_functional   
 Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
 Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
 Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                        Mean   :122.7   Mean   : 67.36  
                                        3rd Qu.:168.8   3rd Qu.: 87.75  
                                        Max.   :894.0   Max.   :752.00  
 pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
 1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
 Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
 Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
 3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
 Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
  pct_unknown    
 Min.   :  0.00  
 1st Qu.:  0.00  
 Median :  0.00  
 Mean   : 12.55  
 3rd Qu.: 20.83  
 Max.   :100.00  

3.4.5.12 visualise distribution :: “status_clean”

Usage of the code chunk below :

qtm( ) - tmap - to plot a thematic map quickly.

tmap_arrange( ) - tmap - to arrange small multiples in grid layout.

total_wp <- qtm(wp_nga,"total_wp")+
  tm_layout(legend.height = 0.3, legend.width = 0.5)

wp_functional <- qtm(wp_nga,"wp_functional")+
  tm_layout(legend.height = 0.3, legend.width = 0.5)

wp_nonFunctional <- qtm(wp_nga,"wp_nonFunctional")+
  tm_layout(legend.height = 0.3, legend.width = 0.5)

wp_unknown <- qtm(wp_nga,"wp_unknown")+
  tm_layout(legend.height = 0.3, legend.width = 0.5)

tmap_arrange(total_wp, wp_functional, wp_nonFunctional, wp_unknown, asp=0, ncol = 2, nrow = 2, widths = 5, heights = 10, sync = TRUE)

3.4.5.13 extract “water_tech_category” to wp_nga

freq(wp_joined1$water_tech_category, sort = "dec")
                    n    % val%
Hand Pump       58755 61.8 61.8
Mechanized Pump 25644 27.0 27.0
Unknown         10055 10.6 10.6
Tapstand          553  0.6  0.6
Rope and Bucket     1  0.0  0.0

Remarks :

Only “Hand Pump”, “Mechanized Pump”, and “Tapstand” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

wtc_hPump <- wp_joined1 %>%
  filter(water_tech_category %in%
           "Hand Pump")

wtc_mPump <- wp_joined1 %>%
  filter(water_tech_category %in%
           "Mechanized Pump")

wtc_tStand <- wp_joined1 %>%
  filter(water_tech_category %in%
           "Tapstand")

wp_nga <- wp_nga %>%
  mutate(`total_handPump` = lengths(
    st_intersects(bdy_nga, wtc_hPump)
  )) %>%
  mutate(`total_mechPump` = lengths(
    st_intersects(bdy_nga, wtc_mPump)
  )) %>%
    mutate(`total_tapStand` = lengths(
    st_intersects(bdy_nga, wtc_tStand)
  )) %>%
  mutate(`pct_handPump` = (`total_handPump`/`total_wp`*100)) %>%
  mutate(`pct_mechPump` = (`total_mechPump`/`total_wp`*100)) %>%
  mutate(`pct_tapStand` = (`total_tapStand`/`total_wp`*100))

-- replace “NaN” with 0

wp_nga <- wp_nga %>%
  mutate(`pct_handPump` = replace_na(pct_handPump, 0)) %>%
  mutate(`pct_mechPump` = replace_na(pct_mechPump, 0)) %>%
  mutate(`pct_tapStand` = replace_na(pct_tapStand, 0))

summary(wp_nga)
  shapeName                  geometry      total_wp     wp_functional   
 Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
 Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
 Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                        Mean   :122.7   Mean   : 67.36  
                                        3rd Qu.:168.8   3rd Qu.: 87.75  
                                        Max.   :894.0   Max.   :752.00  
 pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
 1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
 Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
 Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
 3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
 Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
  pct_unknown     total_handPump   total_mechPump   total_tapStand   
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:  0.00   1st Qu.:  6.00   1st Qu.: 11.00   1st Qu.: 0.0000  
 Median :  0.00   Median : 47.00   Median : 25.50   Median : 0.0000  
 Mean   : 12.55   Mean   : 75.89   Mean   : 33.12   Mean   : 0.7145  
 3rd Qu.: 20.83   3rd Qu.:111.00   3rd Qu.: 46.00   3rd Qu.: 0.0000  
 Max.   :100.00   Max.   :764.00   Max.   :245.00   Max.   :42.0000  
  pct_handPump     pct_mechPump     pct_tapStand    
 Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.: 16.70   1st Qu.: 12.20   1st Qu.: 0.0000  
 Median : 50.99   Median : 31.27   Median : 0.0000  
 Mean   : 48.73   Mean   : 37.54   Mean   : 0.5794  
 3rd Qu.: 77.78   3rd Qu.: 57.71   3rd Qu.: 0.0000  
 Max.   :100.00   Max.   :100.00   Max.   :32.8947  

3.4.5.14 visualise wp_nga distribution :: “water_tech_category”

tmap_mode("view")
tmap mode set to interactive viewing
handPump <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_handPump",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

mechPump <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_mechPump",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

tapStand <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_tapStand",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

tmap_arrange(handPump, mechPump, tapStand,
             asp=1, 
             ncol=2,
             sync = TRUE)
tmap_mode("plot")
tmap mode set to plotting

3.4.5.15 extract “usage_capacity” to wp_nga

freq(wp_joined1$usage_capacity, sort = "dec")
         n    % val%
300  68789 72.4 72.4
1000 25644 27.0 27.0
250    573  0.6  0.6
50       2  0.0  0.0

Remarks :

  • Only “300”, “1000”, and “250” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

  • But, “50” will be included in the new variable “total_ucN1000” as part of the none ‘1000’ “usage_capacity” value.

uCap_300 <- wp_joined1 %>%
  filter(usage_capacity %in%
           "300")

uCap_1000 <- wp_joined1 %>%
  filter(usage_capacity %in%
           "1000")

uCap_250 <- wp_joined1 %>%
  filter(usage_capacity %in%
           "250")

uCap_50 <- wp_joined1 %>%
  filter(usage_capacity %in%
           "50")

wp_nga <- wp_nga %>%
  mutate(`total_uc300` = lengths(
    st_intersects(bdy_nga, uCap_300)
  )) %>%
  mutate(`total_uc1000` = lengths(
    st_intersects(bdy_nga, uCap_1000)
  )) %>%
  mutate(`total_uc250` = lengths(
    st_intersects(bdy_nga, uCap_250)
  )) %>%
  mutate(`total_uc50` = lengths(
    st_intersects(bdy_nga, uCap_50)
  )) %>%
  mutate(`total_ucN1000` = ((lengths(
    st_intersects(
      bdy_nga, uCap_300))) + (lengths(
        st_intersects(
          bdy_nga, uCap_250))) + (lengths(
            st_intersects(
              bdy_nga, uCap_50))))
    )%>%
           
  mutate(`pct_ucN1000` = (`total_ucN1000`/`total_wp`*100)) %>%
  mutate(`pct_uc300` = (`total_uc300`/`total_wp`*100)) %>%
  mutate(`pct_uc1000` = (`total_uc1000`/`total_wp`*100)) %>%
  mutate(`pct_uc250` = (`total_uc250`/`total_wp`*100))

-- replace “NaN” with 0

wp_nga <- wp_nga %>%
  mutate(`pct_ucN1000` = replace_na(pct_ucN1000, 0)) %>%
  mutate(`pct_uc300` = replace_na(pct_uc300, 0)) %>%
  mutate(`pct_uc1000` = replace_na(pct_uc1000, 0)) %>%
  mutate(`pct_uc250` = replace_na(pct_uc250, 0))

summary(wp_nga)
  shapeName                  geometry      total_wp     wp_functional   
 Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
 Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
 Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                        Mean   :122.7   Mean   : 67.36  
                                        3rd Qu.:168.8   3rd Qu.: 87.75  
                                        Max.   :894.0   Max.   :752.00  
 pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
 1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
 Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
 Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
 3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
 Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
  pct_unknown     total_handPump   total_mechPump   total_tapStand   
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:  0.00   1st Qu.:  6.00   1st Qu.: 11.00   1st Qu.: 0.0000  
 Median :  0.00   Median : 47.00   Median : 25.50   Median : 0.0000  
 Mean   : 12.55   Mean   : 75.89   Mean   : 33.12   Mean   : 0.7145  
 3rd Qu.: 20.83   3rd Qu.:111.00   3rd Qu.: 46.00   3rd Qu.: 0.0000  
 Max.   :100.00   Max.   :764.00   Max.   :245.00   Max.   :42.0000  
  pct_handPump     pct_mechPump     pct_tapStand      total_uc300    
 Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000   Min.   :  0.00  
 1st Qu.: 16.70   1st Qu.: 12.20   1st Qu.: 0.0000   1st Qu.: 15.25  
 Median : 50.99   Median : 31.27   Median : 0.0000   Median : 59.00  
 Mean   : 48.73   Mean   : 37.54   Mean   : 0.5794   Mean   : 88.85  
 3rd Qu.: 77.78   3rd Qu.: 57.71   3rd Qu.: 0.0000   3rd Qu.:126.75  
 Max.   :100.00   Max.   :100.00   Max.   :32.8947   Max.   :767.00  
  total_uc1000     total_uc250        total_uc50       total_ucN1000   
 Min.   :  0.00   Min.   : 0.0000   Min.   :0.000000   Min.   :  0.00  
 1st Qu.: 11.00   1st Qu.: 0.0000   1st Qu.:0.000000   1st Qu.: 16.00  
 Median : 25.50   Median : 0.0000   Median :0.000000   Median : 60.00  
 Mean   : 33.12   Mean   : 0.7403   Mean   :0.002584   Mean   : 89.59  
 3rd Qu.: 46.00   3rd Qu.: 0.0000   3rd Qu.:0.000000   3rd Qu.:127.75  
 Max.   :245.00   Max.   :42.0000   Max.   :1.000000   Max.   :767.00  
  pct_ucN1000       pct_uc300        pct_uc1000       pct_uc250      
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.: 39.68   1st Qu.: 38.67   1st Qu.: 12.20   1st Qu.: 0.0000  
 Median : 67.03   Median : 65.91   Median : 31.27   Median : 0.0000  
 Mean   : 60.78   Mean   : 60.17   Mean   : 37.54   Mean   : 0.6114  
 3rd Qu.: 87.35   3rd Qu.: 87.02   3rd Qu.: 57.71   3rd Qu.: 0.0000  
 Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :32.8947  

3.4.5.16 visualise wp_nga distribution :: “usage_capacity”

tmap_mode("view")
tmap mode set to interactive viewing
uc300 <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_uc300",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

uc1000 <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_uc1000",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

uc250 <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_uc250",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

tmap_arrange(uc300, uc1000, uc250,
             asp=1, 
             ncol=2,
             sync = TRUE)
tmap_mode("plot")
tmap mode set to plotting

3.4.5.17 extract “is_urban” to wp_nga

urban_1 <- wp_joined1 %>%
  filter(is_urban %in%
           "TRUE")

urban_0 <- wp_joined1 %>%
  filter(is_urban %in%
           "FALSE")

wp_nga <- wp_nga %>%
  mutate(`total_urban1` = lengths(
    st_intersects(bdy_nga, urban_1)
  )) %>%
  mutate(`total_urban0` = lengths(
    st_intersects(bdy_nga, urban_0)
  )) %>%
  mutate(`pct_urban1` = (`total_urban1`/`total_wp`*100)) %>%
  mutate(`pct_urban0` = (`total_urban0`/`total_wp`*100))

-- replace “NaN” with 0

wp_nga <- wp_nga %>%
  mutate(`pct_urban1` = replace_na(pct_urban1, 0)) %>%
  mutate(`pct_urban0` = replace_na(pct_urban0, 0))

summary(wp_nga)
  shapeName                  geometry      total_wp     wp_functional   
 Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
 Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
 Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                        Mean   :122.7   Mean   : 67.36  
                                        3rd Qu.:168.8   3rd Qu.: 87.75  
                                        Max.   :894.0   Max.   :752.00  
 pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
 1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
 Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
 Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
 3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
 Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
  pct_unknown     total_handPump   total_mechPump   total_tapStand   
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:  0.00   1st Qu.:  6.00   1st Qu.: 11.00   1st Qu.: 0.0000  
 Median :  0.00   Median : 47.00   Median : 25.50   Median : 0.0000  
 Mean   : 12.55   Mean   : 75.89   Mean   : 33.12   Mean   : 0.7145  
 3rd Qu.: 20.83   3rd Qu.:111.00   3rd Qu.: 46.00   3rd Qu.: 0.0000  
 Max.   :100.00   Max.   :764.00   Max.   :245.00   Max.   :42.0000  
  pct_handPump     pct_mechPump     pct_tapStand      total_uc300    
 Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000   Min.   :  0.00  
 1st Qu.: 16.70   1st Qu.: 12.20   1st Qu.: 0.0000   1st Qu.: 15.25  
 Median : 50.99   Median : 31.27   Median : 0.0000   Median : 59.00  
 Mean   : 48.73   Mean   : 37.54   Mean   : 0.5794   Mean   : 88.85  
 3rd Qu.: 77.78   3rd Qu.: 57.71   3rd Qu.: 0.0000   3rd Qu.:126.75  
 Max.   :100.00   Max.   :100.00   Max.   :32.8947   Max.   :767.00  
  total_uc1000     total_uc250        total_uc50       total_ucN1000   
 Min.   :  0.00   Min.   : 0.0000   Min.   :0.000000   Min.   :  0.00  
 1st Qu.: 11.00   1st Qu.: 0.0000   1st Qu.:0.000000   1st Qu.: 16.00  
 Median : 25.50   Median : 0.0000   Median :0.000000   Median : 60.00  
 Mean   : 33.12   Mean   : 0.7403   Mean   :0.002584   Mean   : 89.59  
 3rd Qu.: 46.00   3rd Qu.: 0.0000   3rd Qu.:0.000000   3rd Qu.:127.75  
 Max.   :245.00   Max.   :42.0000   Max.   :1.000000   Max.   :767.00  
  pct_ucN1000       pct_uc300        pct_uc1000       pct_uc250      
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.: 39.68   1st Qu.: 38.67   1st Qu.: 12.20   1st Qu.: 0.0000  
 Median : 67.03   Median : 65.91   Median : 31.27   Median : 0.0000  
 Mean   : 60.78   Mean   : 60.17   Mean   : 37.54   Mean   : 0.6114  
 3rd Qu.: 87.35   3rd Qu.: 87.02   3rd Qu.: 57.71   3rd Qu.: 0.0000  
 Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :32.8947  
  total_urban1     total_urban0      pct_urban1       pct_urban0    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:  0.00   1st Qu.: 23.00   1st Qu.:  0.00   1st Qu.: 57.27  
 Median :  9.00   Median : 64.00   Median : 11.95   Median : 86.45  
 Mean   : 25.27   Mean   : 97.45   Mean   : 25.61   Mean   : 72.71  
 3rd Qu.: 33.00   3rd Qu.:141.00   3rd Qu.: 38.44   3rd Qu.:100.00  
 Max.   :324.00   Max.   :894.00   Max.   :100.00   Max.   :100.00  

3.4.5.18 visualise wp_nga distribution :: “is_urban”

tmap_mode("view")
tmap mode set to interactive viewing
urban1 <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_urban1",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

urban0 <- tm_shape(bdy_nga)+
  tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +  
  tm_dots(col = "pct_urban0",
          border.col = "gray60",
          border.lwd = 0.5) +
  tm_view(set.zoom.limits = c(5,9))

tmap_arrange(urban1, urban0,
             asp=1, 
             ncol=2,
             sync = TRUE)
tmap_mode("plot")
tmap mode set to plotting


3.4.5.19 save and read RDS File :: wp_nga

write_rds(wp_nga,"/jephOstan/ISSS624/class_project/project_2/data/geodata/wp_nga.rds")
wp_nga <- read_rds("/jephOstan/ISSS624/class_project/project_2/data/geodata/wp_nga.rds")

3.4.5.20 transform to Projected Coordinate System

Usage of the code chunk below :

st_crs( ) - sf - to inspect the coordinate reference system.

st_crs(wp_nga)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

Remarks :

The EPSG for wp_nga is 4326, which is WGS 84. To compute the proximity distance matrix for clustering analysis, this coordinate reference system needs to transform into EPSG: 26391.

Usage of the code chunk below :

st_set_crs( ) - sf - to update the coordinate reference system.

wp_ngaTrans <- st_set_crs(wp_nga, 26391)
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
bdy_ngaTrans <- st_set_crs(bdy_nga, 26391)
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that

-- review CRS :: wp_ngaTrans

st_crs(wp_ngaTrans)
Coordinate Reference System:
  User input: EPSG:26391 
  wkt:
PROJCRS["Minna / Nigeria West Belt",
    BASEGEOGCRS["Minna",
        DATUM["Minna",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4263]],
    CONVERSION["Nigeria West Belt",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",4.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.99975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",230738.26,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
        BBOX[3.57,2.69,13.9,6.5]],
    ID["EPSG",26391]]

-- review CRS :: bdy_ngaTrans

st_crs(bdy_ngaTrans)
Coordinate Reference System:
  User input: EPSG:26391 
  wkt:
PROJCRS["Minna / Nigeria West Belt",
    BASEGEOGCRS["Minna",
        DATUM["Minna",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4263]],
    CONVERSION["Nigeria West Belt",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",4.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.99975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",230738.26,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
        BBOX[3.57,2.69,13.9,6.5]],
    ID["EPSG",26391]]

3.5 Exploratory Data Analysis

3.5.1 Identify Outliers

3.5.1.1 plot boxplot “pct_functional”

ggplot(data=wp_ngaTrans, 
       aes(x=`pct_functional`)) +
  geom_boxplot(color="black", 
               fill="#543005")

3.5.1.2 plot boxplot “pct_nonFunctional”

ggplot(data=wp_ngaTrans, 
       aes(x=`pct_nonFunctional`)) +
  geom_boxplot(color="black", 
               fill="#C16622FF")

3.5.1.3 plot boxplot “pct_unknown”

ggplot(data=wp_ngaTrans, 
       aes(x=`pct_unknown`)) +
  geom_boxplot(color="black", 
               fill="#FFA319FF")

Remarks :

Among these 3 key categories of “status_clean”, “unknown” has the most outliers.

3.5.2 Multi-plot Histogram

3.5.2.1 plot histogram for “status_clean”

pctFunctional <- ggplot(data = wp_ngaTrans,
                         aes(x = `pct_functional`))+
  geom_histogram(bins=10,
                 colour = "black",
                 fill = "#543005")

pctNonFunctional <- ggplot(data = wp_ngaTrans,
                         aes(x = `pct_nonFunctional`))+
  geom_histogram(bins=10,
                 colour = "black",
                 fill = "#C16622FF")

pctUnknown <- ggplot(data = wp_ngaTrans,
                     aes(x = `pct_unknown`))+
  geom_histogram(bins = 10,
                 colour = "black",
                 fill = "#FFA319FF")
ggarrange(pctFunctional,pctNonFunctional,pctUnknown,
          ncol = 2,
          nrow = 2)
adding dummy grobs

4. CORRELATION ANALYSIS

4.1 Create Data Table for Correlation Matrix Analysis

cluster_vars <- wp_ngaTrans %>%
  st_set_geometry(NULL) %>%
  select("shapeName",
         "pct_functional", 
         "pct_nonFunctional",
         "pct_unknown", 
         "pct_handPump",
         "pct_mechPump",
         "pct_tapStand",
         "pct_uc300",
         "pct_uc1000",
         "pct_ucN1000",
         "pct_uc250",
         "pct_urban0")
head(cluster_vars,5)
  shapeName pct_functional pct_nonFunctional pct_unknown pct_handPump
1 Aba North       41.17647          52.94118    5.882353    11.764706
2 Aba South       40.84507          46.47887    9.859155     9.859155
3    Abadam        0.00000           0.00000    0.000000     0.000000
4     Abaji       40.35088          59.64912    0.000000    40.350877
5      Abak       47.91667          50.00000    0.000000     8.333333
  pct_mechPump pct_tapStand pct_uc300 pct_uc1000 pct_ucN1000 pct_uc250
1     82.35294            0 17.647059   82.35294   17.647059         0
2     87.32394            0 12.676056   87.32394   12.676056         0
3      0.00000            0  0.000000    0.00000    0.000000         0
4     59.64912            0 40.350877   59.64912   40.350877         0
5     91.66667            0  8.333333   91.66667    8.333333         0
  pct_urban0
1   0.000000
2   5.633803
3   0.000000
4  84.210526
5  83.333333

4.2 Visualise Correlation Matrix

Usage of the code chunk below :

corrplot.mixed( ) - corrplot - to use mixed methods to visualise a correlation matrix.

This plot allows to identify the pattern and the relationship in the matrix.

corrplot.mixed((cor(cluster_vars[,2:12])),
               upper = "number",
               lower = "ellipse",
               tl.col = "black",
               diag = "l",
               tl.pos = "lt")

Remarks :

Following are the pairs with strong correlation :

correlation coefficients variable_1 variable_2
1.00 pct_mechPump pct_uc1000
0.99 pct_tapStand pct_uc250
0.99 pct_uc300 pct_ucN1000
-0.91 pct_mechPump pct_ucN1000
-0.91 pct_uc1000 pct_ucN1000
-0.90 pct_mechPump pct_uc300
-0.90 pct_uc300 pct_uc1000

4.2.1 Replace Row ID with “shapeName”

row.names(cluster_vars) <- cluster_vars$shapeName

4.2.2 Trim High Correlation Variable and “shapeName”

cluster_varsTrim <- cluster_vars %>%
  select(-shapeName, -pct_ucN1000, -pct_mechPump)

4.2.2.1 review trimmed data table

summary(cluster_varsTrim)
 pct_functional   pct_nonFunctional  pct_unknown      pct_handPump   
 Min.   :  0.00   Min.   :  0.00    Min.   :  0.00   Min.   :  0.00  
 1st Qu.: 32.61   1st Qu.: 20.77    1st Qu.:  0.00   1st Qu.: 16.70  
 Median : 47.41   Median : 34.89    Median :  0.00   Median : 50.99  
 Mean   : 49.84   Mean   : 35.58    Mean   : 12.55   Mean   : 48.73  
 3rd Qu.: 66.99   3rd Qu.: 50.00    3rd Qu.: 20.83   3rd Qu.: 77.78  
 Max.   :100.00   Max.   :100.00    Max.   :100.00   Max.   :100.00  
  pct_tapStand       pct_uc300        pct_uc1000       pct_uc250      
 Min.   : 0.0000   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 38.67   1st Qu.: 12.20   1st Qu.: 0.0000  
 Median : 0.0000   Median : 65.91   Median : 31.27   Median : 0.0000  
 Mean   : 0.5794   Mean   : 60.17   Mean   : 37.54   Mean   : 0.6114  
 3rd Qu.: 0.0000   3rd Qu.: 87.02   3rd Qu.: 57.71   3rd Qu.: 0.0000  
 Max.   :32.8947   Max.   :100.00   Max.   :100.00   Max.   :32.8947  
   pct_urban0    
 Min.   :  0.00  
 1st Qu.: 57.27  
 Median : 86.45  
 Mean   : 72.71  
 3rd Qu.:100.00  
 Max.   :100.00  

5. CLUSTERING ANALYSIS

5.1 Hierarchy Clustering

There are four (4) main steps :

  • compute proximity matrix.
  • assign data point to a cluster.
  • merge clusters based on similarity between clusters.
  • update the distance matrix.

5.1.1 Standardise Data

As shown in the 4.2.3.1, there are few variables with Max. different from others. Hence, standardisation will be required prior to further analysis.

5.1.1.1 standardise with min-max method

nga_wpStd <- normalize(cluster_varsTrim)
summary(nga_wpStd)
 pct_functional   pct_nonFunctional  pct_unknown      pct_handPump   
 Min.   :0.0000   Min.   :0.0000    Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.3261   1st Qu.:0.2077    1st Qu.:0.0000   1st Qu.:0.1670  
 Median :0.4741   Median :0.3489    Median :0.0000   Median :0.5099  
 Mean   :0.4984   Mean   :0.3558    Mean   :0.1255   Mean   :0.4873  
 3rd Qu.:0.6699   3rd Qu.:0.5000    3rd Qu.:0.2083   3rd Qu.:0.7778  
 Max.   :1.0000   Max.   :1.0000    Max.   :1.0000   Max.   :1.0000  
  pct_tapStand       pct_uc300        pct_uc1000       pct_uc250      
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.3867   1st Qu.:0.1220   1st Qu.:0.00000  
 Median :0.00000   Median :0.6591   Median :0.3127   Median :0.00000  
 Mean   :0.01761   Mean   :0.6017   Mean   :0.3754   Mean   :0.01859  
 3rd Qu.:0.00000   3rd Qu.:0.8702   3rd Qu.:0.5771   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
   pct_urban0    
 Min.   :0.0000  
 1st Qu.:0.5727  
 Median :0.8645  
 Mean   :0.7271  
 3rd Qu.:1.0000  
 Max.   :1.0000  

5.1.1.2 standardise with Z-score method

nga_wpZ <- scale(cluster_varsTrim)
describe(nga_wpZ)
                  vars   n mean sd median trimmed  mad   min   max range  skew
pct_functional       1 774    0  1  -0.10   -0.02 1.04 -2.06  2.07  4.13  0.14
pct_nonFunctional    2 774    0  1  -0.03   -0.02 1.05 -1.71  3.10  4.81  0.23
pct_unknown          3 774    0  1  -0.62   -0.22 0.00 -0.62  4.30  4.92  2.01
pct_handPump         4 774    0  1   0.07    0.00 1.37 -1.49  1.57  3.06 -0.09
pct_tapStand         5 774    0  1  -0.19   -0.19 0.00 -0.19 10.46 10.65  7.22
pct_uc300            6 774    0  1   0.19    0.08 1.10 -2.02  1.34  3.35 -0.56
pct_uc1000           7 774    0  1  -0.21   -0.09 1.05 -1.28  2.14  3.42  0.61
pct_uc250            8 774    0  1  -0.20   -0.20 0.00 -0.20 10.37 10.57  7.10
pct_urban0           9 774    0  1   0.42    0.17 0.62 -2.23  0.84  3.06 -1.12
                  kurtosis   se
pct_functional       -0.62 0.04
pct_nonFunctional    -0.42 0.04
pct_unknown           4.15 0.04
pct_handPump         -1.33 0.04
pct_tapStand         58.65 0.04
pct_uc300            -0.87 0.04
pct_uc1000           -0.78 0.04
pct_uc250            57.14 0.04
pct_urban0           -0.09 0.04

5.1.1.3 visualise distribution of standardised clustering variable

-- functional water point

fwp <- ggplot(data=cluster_varsTrim, 
             aes(x= `pct_functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Before Standardisation")


fwp_stdDf <- as.data.frame(nga_wpStd)
fwp_std <- ggplot(data=fwp_stdDf, 
       aes(x=`pct_functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Min-Max Stdsn.")

fwp_zDf <- as.data.frame(nga_wpZ)
fwp_z <- ggplot(data=fwp_zDf, 
       aes(x=`pct_functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Z-score Stdsn.")

ggarrange(fwp, fwp_std, fwp_z,
          ncol = 3,
          nrow = 1)

fwp <- ggplot(data=cluster_varsTrim, 
             aes(x= `pct_functional`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Before Standardisation")


fwp_stdDf <- as.data.frame(nga_wpStd)
fwp_std <- ggplot(data=fwp_stdDf, 
       aes(x=`pct_functional`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Min-Max Stdsn.")

fwp_zDf <- as.data.frame(nga_wpZ)
fwp_z <- ggplot(data=fwp_zDf, 
       aes(x=`pct_functional`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Z-score Stdsn.")

ggarrange(fwp, fwp_std, fwp_z,
          ncol = 3,
          nrow = 1)

-- water point deployed with handpump

HP <- ggplot(data=cluster_varsTrim, 
             aes(x= `pct_handPump`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Before Standardisation")


fwp_stdDf <- as.data.frame(nga_wpStd)
HP_std <- ggplot(data=fwp_stdDf, 
       aes(x=`pct_handPump`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Min-Max Stdsn.")

fwp_zDf <- as.data.frame(nga_wpZ)
HP_z <- ggplot(data=fwp_zDf, 
       aes(x=`pct_handPump`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Z-score Stdsn.")

ggarrange(HP, HP_std, HP_z,
          ncol = 3,
          nrow = 1)

HP <- ggplot(data=cluster_varsTrim, 
             aes(x= `pct_handPump`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Before Standardisation")


fwp_stdDf <- as.data.frame(nga_wpStd)
HP_std <- ggplot(data=fwp_stdDf, 
       aes(x=`pct_handPump`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Min-Max Stdsn.")

fwp_zDf <- as.data.frame(nga_wpZ)
HP_z <- ggplot(data=fwp_zDf, 
       aes(x=`pct_handPump`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Z-score Stdsn.")

ggarrange(HP, HP_std, HP_z,
          ncol = 3,
          nrow = 1)

-- water point with 1000 users usage capacity

uc1000 <- ggplot(data=cluster_varsTrim, 
             aes(x= `pct_uc1000`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Before Standardisation")

fwp_stdDf <- as.data.frame(nga_wpStd)
uc1000_std <- ggplot(data=fwp_stdDf, 
       aes(x=`pct_uc1000`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Min-Max Stdsn.")

fwp_zDf <- as.data.frame(nga_wpZ)
uc1000_z <- ggplot(data=fwp_zDf, 
       aes(x=`pct_uc1000`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="steelblue") +
  ggtitle("Z-score Stdsn.")

ggarrange(uc1000, uc1000_std, uc1000_z,
          ncol = 3,
          nrow = 1)

uc1000 <- ggplot(data=cluster_varsTrim, 
             aes(x= `pct_uc1000`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Before Standardisation")


fwp_stdDf <- as.data.frame(nga_wpStd)
uc1000_std <- ggplot(data=fwp_stdDf, 
       aes(x=`pct_uc1000`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Min-Max Stdsn.")

fwp_zDf <- as.data.frame(nga_wpZ)
uc1000_z <- ggplot(data=fwp_zDf, 
       aes(x=`pct_uc1000`)) +
  geom_density(color="black", 
                 fill="steelblue") +
  ggtitle("Z-score Stdsn.")

ggarrange(uc1000, uc1000_std, uc1000_z,
          ncol = 3,
          nrow = 1)

5.1.2 Compute Proximity Matrix

Usage of the code chunk below :

dist( ) - stats - to compute the proximity distance matrix. Among euclidean, maximum, manhattan, canberra, binary and minkowski, euclidean is used to compute proxmat_euc.

proxmat_euc <- dist(cluster_varsTrim, method = 'euclidean')

5.1.3 Compute Hierarchical Clustering

Usage of the code chunk below :

hclust( ) - stats - to compute cluster with agglomeration method.

ggdendrogram( ) - ggdendro - to plot dendrogram with tools available in ggplot2.

hieClust_warD <- hclust(proxmat_euc, method = 'ward.D')
ggdendrogram(hieClust_warD, 
             rotate = TRUE, 
             size = 2, 
             theme_dendro = FALSE)

5.1.4 Determine Optimal Clustering Algorithm

Usage of the code chunk below :

agnes( ) - cluster - to get agglomerative coefficient of 4 clustering structure, namely “average”, “single”, “complete” and “ward”.

m <- c( "average", "single", "complete", "ward")

names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(cluster_varsTrim, method = x)$ac
  }

map_dbl(m, ac)
  average    single  complete      ward 
0.9264460 0.8825086 0.9494033 0.9923235 

Remarks :

  • Value 1 indicate strongest clustering structure.

  • Ward’s method provides the strongest clustering structure. Therefore, Ward’s method to be used in subsequent analysis.

5.1.5 Determine Optimal Clusters

To determine the optimal clusters to retain, following commons methods are tested :

  • Gap statistic

  • Elbow

  • Average Silhouette

5.1.5.1 compute Gap Statistic method

Usage of the code chunk below :

clusGap( ) - cluster - to compute the gap statistic.

set.seed(12345)
gap_stat <- clusGap(cluster_varsTrim, 
                    FUN = hcut, 
                    nstart = 25, 
                    K.max = 30, 
                    B = 50)
# Print the result
print(gap_stat, method = "firstmax")
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = cluster_varsTrim, FUNcluster = hcut, K.max = 30, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..30; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 30
          logW    E.logW       gap      SE.sim
 [1,] 9.815280 10.315513 0.5002330 0.008043258
 [2,] 9.602465 10.203144 0.6006791 0.008510149
 [3,] 9.510126 10.144007 0.6338806 0.010041659
 [4,] 9.443727 10.094054 0.6503272 0.009408325
 [5,] 9.337773 10.055036 0.7172630 0.008377048
 [6,] 9.286151 10.021049 0.7348986 0.008061617
 [7,] 9.226030  9.992116 0.7660864 0.007509762
 [8,] 9.181055  9.967079 0.7860239 0.007406854
 [9,] 9.132662  9.944999 0.8123364 0.007505962
[10,] 9.088576  9.924941 0.8363644 0.008003248
[11,] 9.057435  9.906340 0.8489044 0.008086779
[12,] 9.019733  9.888898 0.8691655 0.008378669
[13,] 8.988798  9.872948 0.8841499 0.008499708
[14,] 8.962331  9.857905 0.8955736 0.008609360
[15,] 8.932159  9.843557 0.9113980 0.008574353
[16,] 8.908477  9.830133 0.9216564 0.008442998
[17,] 8.880801  9.817233 0.9364313 0.008235439
[18,] 8.846775  9.805149 0.9583744 0.008068772
[19,] 8.828254  9.793671 0.9654167 0.007975742
[20,] 8.812263  9.782502 0.9702393 0.007904523
[21,] 8.793736  9.771938 0.9782012 0.007903082
[22,] 8.777957  9.761659 0.9837022 0.007927244
[23,] 8.762944  9.751686 0.9887413 0.007838986
[24,] 8.745719  9.741903 0.9961837 0.007850884
[25,] 8.732706  9.732508 0.9998020 0.007792150
[26,] 8.716858  9.723358 1.0064996 0.007813310
[27,] 8.703095  9.714414 1.0113191 0.007684052
[28,] 8.684688  9.705770 1.0210819 0.007586041
[29,] 8.664250  9.697408 1.0331579 0.007592467
[30,] 8.649964  9.689233 1.0392698 0.007607051

-- visualise gap_stat

Usage of the code chunk below :

fviz_nbclust( ) - factoextra - to compute and visualise the Optimal Number of clusters.

set.seed(12345)
fviz_nbclust(nga_wpZ, 
             kmeans, 
             nstart = 25,  
             method = "gap_stat", 
             nboot = 50)+
  labs(subtitle = "Gap statistic method")
Warning: did not converge in 10 iterations

Warning: did not converge in 10 iterations

Warning: did not converge in 10 iterations

5.1.5.2 compute and visualise Elbow method

fviz_nbclust(nga_wpZ, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")

5.1.5.3 compute and visualise Silhouette method

fviz_nbclust(nga_wpZ, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

Remarks :

Given the Elbow method, Silhouette method and Gap Statistic method, the 5-cluster by Silhouette method will be used for the rest of the study.

5.1.5.4 interpret with Dendrogram

Usage of the code chunk below :

rect.hclust( ) - stats - to draw the dendrogram with a border around the selected clusters.

plot(hieClust_warD, cex = 0.6)
rect.hclust(hieClust_warD, 
            k = 5, 
            border = 2:5)

5.1.6 Visually-Driven Hierarchical Clustering Analysis

The data is loaded into a data frame, but it has to be a data matrix to plot the heatmap. Hence, the data frame will need to first transform into a matrix.

5.1.6.1 transform data frame into matrix

Usage of the code chunk below :

data.matrix( ) - base - to transform cluster_varsTrim data frame into a data matrix, and named it as nga_clustMat.

nga_clustMat <- data.matrix(cluster_varsTrim)

5.1.6.2 plot interactive cluster heatmap

Usage of the code chunk below :

heatmaply( ) - heatmaply - to build an interactive cluster heatmap.

heatmaply(normalize(nga_clustMat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Nigeria by Water Points",
          xlab = "Water Points",
          ylab = "Nigeria LGA"
          )

Remarks :

Based on the plot above, 5 clusters to be retained for further analysis.

5.1.6.3 map the formed cluster

Usage of the code chunk below :

cutree( ) - base - to derive a 5-cluster model, and named the output as groups.

groups <- as.factor(cutree(hieClust_warD, k=5))

5.1.6.4 append groups to wp_ngaTrans

nga_clust.sf <- cbind(wp_ngaTrans, as.matrix(groups)) %>%
  rename(`cluster`=`as.matrix.groups.`)

5.1.6.5 plot choropleth map :: nga_clust.sf

qtm(nga_clust.sf, "cluster")

Remarks :

The choropleth map above shows the fragmented clusters by the used of non-spatial clustering algorithm (hierarchical cluster analysis method).

5.2 Spatially Constrained Clustering :: SKATER Approach

SKATER function only support sp objects in SpatialPolygonDataFrame. Hence, the wp_ngaTrans has to first transform into SpatialPolygonDataFrame before proceed further.

5.2.1 Convert SF to SP Data Frame

Usage of the code chunk below :

as_Spatial( ) - sf - to convert wp_ngaTrans into nga_sp in a SP data frame.

nga.sp <- as_Spatial(wp_ngaTrans)

5.2.2 Compute Neighbour List

Usage of the code chunk below :

poly2nb( ) - spdep - to compute the neighbours list from polygon list.

nga.nb <- poly2nb(nga.sp, queen = TRUE)
summary(nga.nb)
Neighbour list object:
Number of regions: 774 
Number of nonzero links: 4440 
Percentage nonzero weights: 0.7411414 
Average number of links: 5.736434 
1 region with no links:
86
Link number distribution:

  0   1   2   3   4   5   6   7   8   9  10  11  12  14 
  1   2  14  57 125 182 140 122  72  41  12   4   1   1 
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links

Remarks :

There is one (1) region, i.e. #86 is without link. It has to be removed first before proceed to plot the neighbours list.

5.2.2.1 remove 0-neighbour region

wp_ngaTrans1 <- wp_ngaTrans[-86,]
cluster_varsTrim1 <- cluster_varsTrim[-86,]
nga_clust.sf1 <- nga_clust.sf[-86,]
nga_wpZ1 <- nga_wpZ[-86,]
nga.sp1 <- as_Spatial(wp_ngaTrans1)

nga.nb1 <- poly2nb(nga.sp1)
summary(nga.nb1)
Neighbour list object:
Number of regions: 773 
Number of nonzero links: 4440 
Percentage nonzero weights: 0.7430602 
Average number of links: 5.743855 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  2  14  57 125 182 140 122  72  41  12   4   1   1 
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links

5.2.2.2 plot Neighbour List by Centroid Node

Usage of the code chunk below : plot the boundary first before the neighbour list object to avoid any region from being clipped away.

plot(nga.sp1, 
     border=grey(.5))
plot(nga.nb1, 
     coordinates(nga.sp1), 
     col="blue", 
     add=TRUE)

5.2.3 Compute Minimum Spanning Tree (MST)

5.2.3.1 calculate edge costs

Usage of the code chunk below :

nbcosts( ) - spdep - to compute the cost of each edge which is the distance between nodes.

edge_cost <- nbcosts(nga.nb1, cluster_varsTrim1)

5.2.3.2 specify spatial weight

nb2listw( ) - spdep - to specify edge_cost as the spatial weights. Set the “style” to “B” to ensure the cost values are not row-standardised.

nga.w <- nb2listw(nga.nb1,
                  edge_cost,
                  style = "B")
Warning in nb2listw(nga.nb1, edge_cost, style = "B"): zero sum general weights
summary(nga.w)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 773 
Number of nonzero links: 4440 
Percentage nonzero weights: 0.7430602 
Average number of links: 5.743855 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  2  14  57 125 182 140 122  72  41  12   4   1   1 
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links

Weights style: B 
Weights constants summary:
    n     nn       S0       S1        S2
B 773 597529 245120.7 38463020 406298492

5.2.3.3 compute minimum spanning tree

Usage of the code chunk below :

nbcosts( ) - spdep - to compute the minimum spanning tree.

nga_minSpanT <- mstree(nga.w)

-- review class and dimension of the computed MST

class(nga_minSpanT)
[1] "mst"    "matrix"
dim(nga_minSpanT)
[1] 772   3
head(nga_minSpanT)
     [,1] [,2]      [,3]
[1,]  474  387 134.66287
[2,]  387  478  64.55618
[3,]  387  439  78.95255
[4,]  439  476  50.91751
[5,]  439  270 105.68114
[6,]  270   90  66.67241

5.2.3.4 plot MST Neighbour List

plot(nga.sp1, border=gray(.5))
plot.mst(nga_minSpanT,
         coordinates(nga.sp1), 
         col="blue", 
         cex.lab=0.7, 
         cex.circles=0.005, 
         add=TRUE)

5.2.4 Compute Spatially Constrained Cluster

Usage of the code chunk below :

skater( ) - spdep - to compute the spatially constrained cluster.

clust5 <- spdep::skater(edges = nga_minSpanT[,1:2],
                        data = cluster_varsTrim1,
                        method = "euclidean",
                        ncuts = 4)
str(clust5)
List of 8
 $ groups      : num [1:773] 3 3 1 5 4 1 2 2 1 3 ...
 $ edges.groups:List of 5
  ..$ :List of 3
  .. ..$ node: num [1:309] 773 747 492 131 382 224 413 488 439 257 ...
  .. ..$ edge: num [1:308, 1:3] 131 382 224 413 257 767 439 704 476 75 ...
  .. ..$ ssw : num 17013
  ..$ :List of 3
  .. ..$ node: num [1:129] 597 315 316 557 195 571 339 744 205 213 ...
  .. ..$ edge: num [1:128, 1:3] 315 316 557 195 571 15 82 579 744 351 ...
  .. ..$ ssw : num 7874
  ..$ :List of 3
  .. ..$ node: num [1:85] 364 10 729 215 337 551 102 103 66 19 ...
  .. ..$ edge: num [1:84, 1:3] 23 536 578 103 19 375 727 617 188 103 ...
  .. ..$ ssw : num 4545
  ..$ :List of 3
  .. ..$ node: num [1:39] 550 202 330 287 374 732 537 586 733 201 ...
  .. ..$ edge: num [1:38, 1:3] 612 586 136 245 332 429 504 537 586 616 ...
  .. ..$ ssw : num 1294
  ..$ :List of 3
  .. ..$ node: num [1:211] 67 510 401 122 24 526 475 489 663 303 ...
  .. ..$ edge: num [1:210, 1:3] 67 549 510 119 639 401 556 122 693 70 ...
  .. ..$ ssw : num 10380
 $ not.prune   : NULL
 $ candidates  : int [1:5] 1 2 3 4 5
 $ ssto        : num 52660
 $ ssw         : num [1:5] 52660 48724 44268 42679 41106
 $ crit        : num [1:2] 1 Inf
 $ vec.crit    : num [1:773] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "class")= chr "skater"

5.2.4.1 tabulate cluster assignment

ccs5 <- clust5$groups
table(ccs5)
ccs5
  1   2   3   4   5 
309 129  85  39 211 

5.2.4.2 plot the pruned tree

plot(nga.sp1, border=gray(.5))
plot(clust5, 
     coordinates(nga.sp1), 
     cex.lab=.7,
     groups.colors=c("red","green","blue", "brown", "pink"),
     cex.circles=0.005, 
     add=TRUE)
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter

Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter

Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter

Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter

Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter

5.2.5 Visualise SKATER Clusters in Choropleth Map

groups_mat <- as.matrix(clust5$groups)

nga_spClust.sf <- cbind(nga_clust.sf1, as.factor(groups_mat)) %>%
  rename(`sp_cluster`=`as.factor.groups_mat.`)

To compare the output of hierarchical clustering and spatially constrained hierarchical clustering :

hieClust_map <- qtm(nga_clust.sf1,
                  "cluster") + 
  tm_borders(alpha = 0.5) 

ngaClust_map <- qtm(nga_spClust.sf,
                   "sp_cluster") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(hieClust_map, ngaClust_map,
             asp=NA, ncol=2)
Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

5.3 Spatially Constrained Clustering :: ClustGeo Method

5.3.1 Perform Ward-like Hierarchical Clustering

Usage of the code chunk below :

hclustgeo( ) - ClustGeo - to perform a typical Ward-like hierarchical clustering.

proxmat_ngc <- dist(cluster_varsTrim1, method = 'euclidean')
nonGeo_clust <- hclustgeo(proxmat_ngc)
plot(nonGeo_clust, cex = 0.5)
rect.hclust(nonGeo_clust, 
            k = 5, 
            border = 2:5)

5.3.1.1 visualise the formed clusters

groups_ngc <- as.factor(cutree(nonGeo_clust, k=5))
nga_ngeo_clust.sf <- cbind(wp_ngaTrans1, as.matrix(groups_ngc)) %>%
  rename(`cluster` = `as.matrix.groups_ngc.`)
qtm(nga_ngeo_clust.sf, "cluster")

5.3.2 Perform Spatially Constrained Hierarchical Clustering

Usage of the code chunk below :

st_distance( ) - sf - to derive the spatial distance matrix before perform spatially constrained hierarchical clustering.

as.dist( ) - stats - to convert the data frame into matrix.

dist <- st_distance(wp_ngaTrans1, wp_ngaTrans1)
dist_mat <- as.dist(dist)

5.3.2.1 determine alpha value

choicealpha( ) - psych - to determine a suitable value for the mixing parameter alpha.

cr <- choicealpha(
  proxmat_ngc, 
  dist_mat, 
  range.alpha = seq(0, 1, 0.1), 
  K=5, 
  graph = TRUE)

Remarks :

With reference to the plot above, alpha = 0.4 to be used to perform spatially constrained hierarchical clustering.

5.3.2.2 compute spatially constrained hierarchical clustering

clustG <- hclustgeo(proxmat_ngc, 
                    dist_mat, 
                    alpha = 0.4)

5.3.2.3 derive cluster object

groups_cg <- as.factor(cutree(clustG, k=5))

5.3.2.4 combine group_cg with wp_ngaTrans1

wp_nga1_GClust <- cbind(wp_ngaTrans1, as.matrix(groups_cg)) %>%
  rename(`cluster` = `as.matrix.groups_cg.`)

5.3.2.5 plot delineated spatially constrained cluster

qtm(wp_nga1_GClust, "cluster")


5.4 Visual Interpretation of Clusters

5.4.1 Visualise Individual Clustering Variable

5.4.1.1 plot boxplot

ggplot(data = nga_ngeo_clust.sf,
       aes(x = cluster, y = pct_functional)) +
  geom_boxplot()

Remarks :

The boxplot reveals Cluster 5 displays the highest mean of functional water points. This is followed by Cluster 1, 3, 2, and 4.

5.4.2 Visualise Multivariate

Usage of the code chunk below :

ggparcoord( ) - GGally - to reveal clustering variables by cluster.

nga_ngeo_clust.sf1 <- nga_ngeo_clust.sf %>%
  select("shapeName", 
         "pct_functional", 
         "pct_nonFunctional", 
         "pct_unknown", 
         "pct_handPump", 
         "pct_tapStand", 
         "pct_uc300", 
         "pct_uc1000", 
         "pct_uc250", 
         "pct_urban0",
         "cluster")
         
head(nga_ngeo_clust.sf1,3)
Simple feature collection with 3 features and 11 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 7.307433 ymin: 5.052192 xmax: 13.83477 ymax: 13.71406
Projected CRS: Minna / Nigeria West Belt
  shapeName pct_functional pct_nonFunctional pct_unknown pct_handPump
1 Aba North       41.17647          52.94118    5.882353    11.764706
2 Aba South       40.84507          46.47887    9.859155     9.859155
3    Abadam        0.00000           0.00000    0.000000     0.000000
  pct_tapStand pct_uc300 pct_uc1000 pct_uc250 pct_urban0 cluster
1            0  17.64706   82.35294         0   0.000000       1
2            0  12.67606   87.32394         0   5.633803       1
3            0   0.00000    0.00000         0   0.000000       1
                        geometry
1 MULTIPOLYGON (((7.401109 5....
2 MULTIPOLYGON (((7.334479 5....
3 MULTIPOLYGON (((13.83477 13...
ggparcoord(data = nga_ngeo_clust.sf,
           columns = c(2:19),
           scale = "globalminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Variables by Cluster") +
  facet_grid(~ cluster, scales = "fixed") + 
  theme(axis.text.x = element_text(angle = 30))

The parallel coordinate plot above reveals that households in Cluster 4 townships tend to own the highest number of TV and mobile-phone. On the other hand, households in Cluster 5 tends to own the lowest of all the five ICT.

Note that the scale argument of ggparcoor() provide several methods to scale the clustering variables. They are:

  • std: univariately, subtract mean and divide by standard deviation.

  • robust: univariately, subtract median and divide by median absolute deviation.

  • uniminmax: univariately, scale so the minimum of the variable is zero, and the maximum is one.

  • globalminmax: no scaling is done; the range of the graphs is defined by the global minimum and the global maximum.

  • center: use uniminmax to standardize vertical height, then center each variable at a value specified by the scaleSummary param.

  • centerObs: use uniminmax to standardize vertical height, then center each variable at the value of the observation specified by the centerObsID param

5.4.3 Compute Summary Statistics

nga_ngeo_clust.sf %>% 
  st_set_geometry(NULL) %>%
  group_by(cluster) %>%
  summarise(mean_pct_functional = mean(pct_functional),
            mean_pct_nonFunctional = mean(pct_nonFunctional),
            mean_pct_unknown = mean(pct_unknown),
            mean_pct_handPump = mean(pct_handPump), 
            mean_pct_tapStand = mean(pct_tapStand), 
            mean_pct_uc300 = mean(pct_uc300), 
            mean_pct_uc1000 = mean(pct_uc1000), 
            mean_pct_uc250 = mean(pct_uc250), 
            mean_pct_urban0 = mean(pct_urban0))
# A tibble: 5 × 10
  cluster mean_pct_fun…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵ mean_…⁶ mean_…⁷ mean_…⁸
  <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 1                 51.7    29.0   9.25    35.0   0.525     44.1    45.5  0.539 
2 2                 46.2    42.2  11.3     59.9   1.00      70.6    28.3  1.03  
3 3                 40.5    49.4   9.15     9.87  0.304     19.2    80.4  0.345 
4 4                 18.4    14.6  67.0     18.1   0.337     72.1    27.5  0.410 
5 5                 78.9    20.7   0.295   88.7   0.0677    89.2    10.7  0.0903
# … with 1 more variable: mean_pct_urban0 <dbl>, and abbreviated variable names
#   ¹​mean_pct_functional, ²​mean_pct_nonFunctional, ³​mean_pct_unknown,
#   ⁴​mean_pct_handPump, ⁵​mean_pct_tapStand, ⁶​mean_pct_uc300, ⁷​mean_pct_uc1000,
#   ⁸​mean_pct_uc250